library(tidyverse)
process_atc_data <- function(year, url) {
# Read the dataset directly from the URL
df <- read_delim(url, delim = ";") # Assuming the file is semicolon-delimited
# Attach column names
colnames(df) <- c("atc","year","sector","region","sex","agegroup","count_persons",
"count_persons_per1kpop","turnover","reimbursement",
"sold_amount", "sold_amount_1kpop_day", "personreferabledata_perc")
# the column with missing values messes stuff up
colnames(df)[is.na(colnames(df))] <- "missing_name"
df <- df |> select(-missing_name)
# Filter the dataset to get only what I need
df_filtered <- df %>%
filter(str_detect(agegroup, "-") & str_starts(atc, "A10")) |>
select(atc,year,agegroup,turnover)
# Define drug classes
DCs <- c("A10BJ", "A10BK", "A10BH", "A10BA", "A10BB", "A10BG", "A10BX", "A10A")
# Further filter and add a new drug class variable
df_filtered <- df_filtered %>%
filter(atc %in% DCs) %>%
mutate(
DC = case_when(
atc == "A10BJ" ~ "GLP1",
atc == "A10BK" ~ "SGLT2",
atc == "A10BH" ~ "DPP4",
atc == "A10BA" ~ "Metformin",
atc == "A10BB" ~ "SU",
atc == "A10BG" ~ "Thiazo",
atc == "A10BX" ~ "Others",
atc == "A10A" ~ "INSULIN",
TRUE ~ NA_character_
),
turnover1000k = turnover / 1000 # Turnover in 1000k units
)
# Assign the names to datasets and plots using the year variable
df_name <- paste0("df_", year)
assign(df_name, df_filtered, envir = .GlobalEnv)
# Return the processed data and the plots
return(list(data = df_filtered)) #, total_turnover_plot = p1, proportional_turnover_plot = p2
}Purpose
I want to follow up the previous post, and the promise of showing how to put it all in to a function.
As has been written before here, if you do something several times in R, you might as well write a function.
So I am going to condense the code I wrote for the purpose of presenting turnover for some diabetes drug classes in the previous post, and turn it in to a function.
Turning it into a function
So, I don’t need to take all of the code from the previous post, just what I need for the function. I also slim the data down a bit with the select() function.
I have not brought over the code for making the plots, because I aim to use the facet_wrap() from ggplot2 to make some plots where its easier to see how the turnover has developed over the years.
To achieve that, I need to append the datasets I can create with the above function.
Creating the datasets for the plots
Now, I have the list of datasets and their corresponding URLs from the last post. There should be a way to automate fetching the URLs, for example by searching the site and matching the line with the dataset name and the line with the link to download it.
For now, I just use what I already found.
# get data from years 2016-2023
process_atc_data(2016,"https://medstat.dk/da/download/file/MjAxNl9hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1861363 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 88 dbl (9): 2016, 0...3, 0...4, ...7, ...8, 1925435,
1159429, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 619 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2016 00-17 9354 INSULIN 9.35
2 A10A 2016 00-17 10103 INSULIN 10.1
3 A10A 2016 00-17 19456 INSULIN 19.5
4 A10A 2016 18-24 7880 INSULIN 7.88
5 A10A 2016 18-24 10883 INSULIN 10.9
6 A10A 2016 18-24 18763 INSULIN 18.8
7 A10A 2016 25-44 27658 INSULIN 27.7
8 A10A 2016 25-44 41457 INSULIN 41.5
9 A10A 2016 25-44 69115 INSULIN 69.1
10 A10A 2016 45-64 67858 INSULIN 67.9
# ℹ 609 more rows
process_atc_data(2017,"https://medstat.dk/da/download/file/MjAxN19hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1851679 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 88 dbl (9): 2017, 0...3, 0...4, ...7, ...8, 2008320,
1212005, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 621 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2017 00-17 9197 INSULIN 9.20
2 A10A 2017 00-17 10691 INSULIN 10.7
3 A10A 2017 00-17 19888 INSULIN 19.9
4 A10A 2017 18-24 8042 INSULIN 8.04
5 A10A 2017 18-24 10468 INSULIN 10.5
6 A10A 2017 18-24 18509 INSULIN 18.5
7 A10A 2017 25-44 27295 INSULIN 27.3
8 A10A 2017 25-44 41429 INSULIN 41.4
9 A10A 2017 25-44 68724 INSULIN 68.7
10 A10A 2017 45-64 70076 INSULIN 70.1
# ℹ 611 more rows
process_atc_data(2018,"https://medstat.dk/da/download/file/MjAxOF9hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1841721 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 88 dbl (9): 2018, 0...3, 0...4, ...7, ...8, 2154536,
1311885, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 607 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2018 00-17 9467 INSULIN 9.47
2 A10A 2018 00-17 11006 INSULIN 11.0
3 A10A 2018 00-17 20473 INSULIN 20.5
4 A10A 2018 18-24 8215 INSULIN 8.22
5 A10A 2018 18-24 10421 INSULIN 10.4
6 A10A 2018 18-24 18636 INSULIN 18.6
7 A10A 2018 25-44 28126 INSULIN 28.1
8 A10A 2018 25-44 42639 INSULIN 42.6
9 A10A 2018 25-44 70765 INSULIN 70.8
10 A10A 2018 45-64 72902 INSULIN 72.9
# ℹ 597 more rows
process_atc_data(2019,"https://medstat.dk/da/download/file/MjAxOV9hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1833788 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 89 dbl (9): 2019, 0...3, 0...4, ...7, ...8, 2395766,
1467188, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 601 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2019 00-17 9599 INSULIN 9.60
2 A10A 2019 00-17 11231 INSULIN 11.2
3 A10A 2019 00-17 20830 INSULIN 20.8
4 A10A 2019 18-24 7867 INSULIN 7.87
5 A10A 2019 18-24 10313 INSULIN 10.3
6 A10A 2019 18-24 18180 INSULIN 18.2
7 A10A 2019 25-44 26903 INSULIN 26.9
8 A10A 2019 25-44 41402 INSULIN 41.4
9 A10A 2019 25-44 68305 INSULIN 68.3
10 A10A 2019 45-64 69854 INSULIN 69.9
# ℹ 591 more rows
process_atc_data(2020,"https://medstat.dk/da/download/file/MjAyMF9hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1817857 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 90 dbl (9): 2020, 0...3, 0...4, ...7, ...8, 2456459,
1511380, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 609 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2020 00-17 9272 INSULIN 9.27
2 A10A 2020 00-17 10962 INSULIN 11.0
3 A10A 2020 00-17 20234 INSULIN 20.2
4 A10A 2020 18-24 7564 INSULIN 7.56
5 A10A 2020 18-24 9789 INSULIN 9.79
6 A10A 2020 18-24 17353 INSULIN 17.4
7 A10A 2020 25-44 26183 INSULIN 26.2
8 A10A 2020 25-44 39902 INSULIN 39.9
9 A10A 2020 25-44 66085 INSULIN 66.1
10 A10A 2020 45-64 66259 INSULIN 66.3
# ℹ 599 more rows
process_atc_data(2021,"https://medstat.dk/da/download/file/MjAyMV9hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1802119 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 90 dbl (9): 2021, 0...3, 0...4, ...7, ...8, 2719403,
1650909, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 613 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2021 00-17 9202 INSULIN 9.20
2 A10A 2021 00-17 10907 INSULIN 10.9
3 A10A 2021 00-17 20109 INSULIN 20.1
4 A10A 2021 18-24 7484 INSULIN 7.48
5 A10A 2021 18-24 9554 INSULIN 9.55
6 A10A 2021 18-24 17038 INSULIN 17.0
7 A10A 2021 25-44 24773 INSULIN 24.8
8 A10A 2021 25-44 38364 INSULIN 38.4
9 A10A 2021 25-44 63137 INSULIN 63.1
10 A10A 2021 45-64 61082 INSULIN 61.1
# ℹ 603 more rows
process_atc_data(2022,"https://medstat.dk/da/download/file/MjAyMl9hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1814587 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 91 dbl (9): 2022, 0...3, 0...4, ...7, ...8, 3112077,
1863246, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 609 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2022 00-17 7502 INSULIN 7.50
2 A10A 2022 00-17 8733 INSULIN 8.73
3 A10A 2022 00-17 16235 INSULIN 16.2
4 A10A 2022 18-24 5899 INSULIN 5.90
5 A10A 2022 18-24 7943 INSULIN 7.94
6 A10A 2022 18-24 13841 INSULIN 13.8
7 A10A 2022 25-44 19962 INSULIN 20.0
8 A10A 2022 25-44 30580 INSULIN 30.6
9 A10A 2022 25-44 50543 INSULIN 50.5
10 A10A 2022 45-64 49506 INSULIN 49.5
# ℹ 599 more rows
process_atc_data(2023,"https://medstat.dk/da/download/file/MjAyM19hdGNfY29kZV9kYXRhLnR4dA==")New names:
Rows: 1806544 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(4): A...1, A...5, A...6, 94 dbl (9): 2023, 0...3, 0...4, ...7, ...8, 5009988,
2661589, ...11, ...12 lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `A` -> `A...1`
• `0` -> `0...3`
• `0` -> `0...4`
• `A` -> `A...5`
• `A` -> `A...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...11`
• `` -> `...12`
• `` -> `...14`
$data
# A tibble: 563 × 6
atc year agegroup turnover DC turnover1000k
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 A10A 2023 00-17 7510 INSULIN 7.51
2 A10A 2023 00-17 8599 INSULIN 8.60
3 A10A 2023 00-17 16109 INSULIN 16.1
4 A10A 2023 18-24 5980 INSULIN 5.98
5 A10A 2023 18-24 8058 INSULIN 8.06
6 A10A 2023 18-24 14039 INSULIN 14.0
7 A10A 2023 25-44 19680 INSULIN 19.7
8 A10A 2023 25-44 30630 INSULIN 30.6
9 A10A 2023 25-44 50311 INSULIN 50.3
10 A10A 2023 45-64 46624 INSULIN 46.6
# ℹ 553 more rows
Actually, I could just use the purr::map() function.
# Load the purr library
library(purrr)
# Define a list of URLs corresponding to each year (2016 to 2023)
urls <- list(
"2016" = "https://medstat.dk/da/download/file/MjAxNl9hdGNfY29kZV9kYXRhLnR4dA==",
"2017" = "https://medstat.dk/da/download/file/MjAxN19hdGNfY29kZV9kYXRhLnR4dA==",
"2018" = "https://medstat.dk/da/download/file/MjAxOF9hdGNfY29kZV9kYXRhLnR4dA==",
"2019" = "https://medstat.dk/da/download/file/MjAxOV9hdGNfY29kZV9kYXRhLnR4dA==",
"2020" = "https://medstat.dk/da/download/file/MjAyMF9hdGNfY29kZV9kYXRhLnR4dA==",
"2021" = "https://medstat.dk/da/download/file/MjAyMV9hdGNfY29kZV9kYXRhLnR4dA==",
"2022" = "https://medstat.dk/da/download/file/MjAyMl9hdGNfY29kZV9kYXRhLnR4dA==",
"2023" = "https://medstat.dk/da/download/file/MjAyM19hdGNfY29kZV9kYXRhLnR4dA=="
)
# Vector of years you want to process
years <- 2016:2023
# Use map to iterate over years and URLs
df_allyears <- map2(years, urls, process_atc_data)
# Example: Access the result for a specific year
df_2023 <- df_allyears[[1]] # The first year of the bunch, 2016
# Combine all the datasets into one - and avoiding getting "$" in all the the variable names while doing it
df_allyears <- bind_rows(df_allyears)
df_allyears <- map(df_allyears, as_tibble)
df_allyears <- bind_rows(df_allyears)This is using even less space (if you look past the part about defining a list of URLs). I swear, I do not know why i need to repeat the bind rows function for it to work.
Making the plots
Now that we have a combined dataset, df_allyears, with all the data to recreate the plots from the last post, we can try to make a plot. Let’s see if facet_wrap() makes a reasonable graph with all eight years.
# Making the base graph to add unto
p_base <- ggplot(df_allyears, aes(x = agegroup, y = turnover1000k, fill = DC)) +
labs(caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority")
# Just seeing how it looks with ALL data summarised over the years
p_base + geom_bar(stat = "identity")p_base + geom_bar(stat = "identity", position = "fill")# Testing facet_wrap on the total turnover
p_base +
geom_bar(stat = "identity") +
facet_wrap(~year)p_base +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~year)Now, that is a LOT of information squeezed down on little space. So lets subset it and do the facetwrap() on four years at a time.
# Subset the data
df_early <- df_allyears |>
filter(year < 2020)
df_late <- df_allyears |>
filter(year >= 2020)
# Now for the graphs
# New bases
p_base_early <- ggplot(df_early, aes(x = agegroup, y = turnover1000k, fill = DC)) +
labs(caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority")
p_base_late <- ggplot(df_late, aes(x = agegroup, y = turnover1000k, fill = DC)) +
labs(caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority")
# New graphs
# Early
p_base_early +
geom_bar(stat = "identity") +
facet_wrap(~year)p_base_early +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~year) # Late
p_base_late +
geom_bar(stat = "identity") +
facet_wrap(~year)p_base_late +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~year)Ok, but we can do this better, making it more easily comparable by having the same y-axis on both the early and late dataset.
p_base_early <- ggplot(df_early, aes(x = agegroup, y = turnover1000k, fill = DC)) +
labs(caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority")
p_base_late <- ggplot(df_late, aes(x = agegroup, y = turnover1000k, fill = DC)) +
labs(caption = "Source: own calculations based on data from medstat.dk via the Danish Health Data Authority")
# New graphs
# Early
p_base_early +
geom_bar(stat = "identity") +
facet_wrap(~year) + ylim(0,7500)p_base_early +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~year) # Late
p_base_late +
geom_bar(stat = "identity") +
facet_wrap(~year) + ylim(0,7500)p_base_late +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~year) +
labs(y = "", x = "")ggsave("thumbnail.jpg", plot = last_plot(), width = 6, height = 4) # saving p_base_late as thumbnailCommenting on the output
So, the graphs overwhelmingly show a tremendous increase in the spending on GLP1.
Where insulin in the early period of 2016-2019 rivaled or was higher than GLP1, in the late period, GLP1 completely overshadows all other drugs within the chosen classes.
Between 2022 and 2023 there seem to be a doubling of the turnover. Turnover typically represents the overall revenue generated in the pharmacy sector.
This can mean both increased spending, and increased prices. We will look at that in the next post.